-
Notifications
You must be signed in to change notification settings - Fork 379
Metadata Log Entries metadata table #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata Log Entries metadata table #667
Conversation
pyiceberg/table/snapshots.py
Outdated
parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None) | ||
sequence_number: Optional[int] = Field(alias="sequence-number", default=None) | ||
# cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import | ||
sequence_number: Optional[int] = Field(alias="sequence-number", default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason the default value for the sequence number has to be changed to 0 as opposed to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, https://iceberg.apache.org/spec/#version-2
Snapshot JSON:
sequence-number was added and is required; default to 0 when reading v1 metadata
Also added this in the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making sequence_number
default to 0 here leads to sequence_number=0
be written to version 1 table metada's snapshots, which is not allowed by spec:
Writing v1 metadata:
Snapshot field sequence-number should not be written
I think we may need a new field_serializer in TableMetadataCommonFields
class or some other ways to correct the behavior on write. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I missed the part about the V1 spec. Following your suggestion, I added a field_serializer
for TableMetadataCommonFields
. This will ensure that the Snapshot
pydantic object will not have the sequence-number
field for V1 format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu Thanks for working on this! It looks geat.
pyiceberg/table/__init__.py
Outdated
# imitates `addPreviousFile` from Java | ||
# https://github.com/apache/iceberg/blob/8248663a2a1ffddd2664ea37b45882455466f71c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1450-L1451 | ||
metadata_log_entries = self.tbl.metadata.metadata_log + [ | ||
MetadataLogEntry(metadata_file=self.tbl.metadata_location, timestamp_ms=self.tbl.metadata.last_updated_ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this line acts more like https://github.com/apache/iceberg/blob/8a70fe0ff5f241aec8856f8091c77fdce35ad256/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L62-L66.
Just curious the reason that you mention addPreviousFile
here, which seem to be more relevant when we update the metadata_log during table commit.
BTW, this reminds me that currently non-rest catalog does not update the metadata_log field during commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! I link the wrong code. I was trying to find all the places where the metadata log was modified
pyiceberg/table/snapshots.py
Outdated
parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None) | ||
sequence_number: Optional[int] = Field(alias="sequence-number", default=None) | ||
# cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import | ||
sequence_number: Optional[int] = Field(alias="sequence-number", default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making sequence_number
default to 0 here leads to sequence_number=0
be written to version 1 table metada's snapshots, which is not allowed by spec:
Writing v1 metadata:
Snapshot field sequence-number should not be written
I think we may need a new field_serializer in TableMetadataCommonFields
class or some other ways to correct the behavior on write. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu Thanks for the quick update! Just have one minor comment for the test. Overall it looks great!
Merged! @kevinjqliu Thanks for another great metadata table work! Thanks @corleyma @syun64 @Fokko for reviewing! |
Resolves #594 (and part of #511)
This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (
metadata_log_entries
).To query the metadata table, use
References
MetadataLogEntriesTable.java
Metadata log, append the latest metadata
The metadata log entries append the latest
TableMetadata
in its operation (code).This leads to a surprising behavior where the last row of the metadata entries table is based on when the query ran.
For example,
Snapshot
sequence-number
default valueThere's an issue with reading V1 spec where the
sequence-number
isNone
instead of0
. According to the Iceberg spec, when reading v1 metadata for v2, theSnapshot
fieldsequence-number
must default to 0 (source).Similarly when writing V1 spec from V2, the
sequence-number
should not be written. This is achieved by the newfield_serializer
function inTableMetadataCommonFields
.